编程大规模并行处理器：实践导向：CUDA 执行模型：主机与设备对比

CUDA 执行模型将你的计算机转变为一个高性能的异构系统。想象一位 总指挥（主机/中央处理器） 和一支 数千人的军队（设备/图形处理器）。总指挥负责处理复杂的逻辑和决策，而军队则同时执行大量重复的任务。

1. 架构差异

主机 是专为复杂控制流和串行任务优化的低延迟中央处理器。相反， 设备 是专为高吞吐量设计的图形处理器，包含成千上万的简单核心，能够同时在庞大的数据集上执行相同的指令。 是专为高吞吐量设计的图形处理器，包含成千上万的简单核心，能够同时在庞大的数据集上执行相同的指令。

2. 执行节奏

CUDA 程序以一系列阶段运行。执行从主机开始，处理“串行代码”。当程序遇到“并行内核”时，它会向设备启动一个 线程网格 。设备完成其大规模工作后，控制权返回主机。

3. 性能专业化

该模型充分利用了两者的优点：中央处理器管理系统资源和复杂分支，而图形处理器执行 SPMD（单程序多数据） 逻辑，以并行方式处理数据元素。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

Which architecture is characterized as being 'throughput-optimized'?

The Host (Intel® CPU)

The Device (NVIDIA® GPU)

The System RAM

The PCIe Bus

QUESTION 2

The reader should complete Part 1 of the MatrixMultiplication() example in Figure 3.6 with similar declarations of an Nd and a Pd pointer variable as well as their corresponding cudaMalloc() calls. Furthermore, Part 3 in Figure 3.6 can be completed with mandatory calls.

float *Nd, *Pd; cudaMalloc((void**)&Nd, size); ... cudaFree(Nd);

float Nd, Pd; malloc(&Nd, size); ... free(Nd);

float *Nd, *Pd; cudaMemcpy(Nd, Pd, size); ... delete Nd;

int Nd, Pd; Nd = new float[size]; ... free(Nd);

QUESTION 3

In the CUDA execution model, where does a program always begin its execution?

On the Device (GPU)

Simultaneously on both

On the Host (CPU)

In the Global Memory

QUESTION 4

What happens when the Host encounters a phase with rich data parallelism?

It speeds up its clock frequency.

It launches a Kernel onto the Device.

It stores the data in the Host Cache.

It converts the code to Python.

QUESTION 5

A student attempts to launch a 1024x1024 matrix multiplication on G80 hardware using 1024 blocks, where each thread calculates one element. Why will this fail?

The G80 cannot handle 1024 blocks.

The total number of threads exceeds 1 million.

The configuration results in 1024 threads per block, exceeding the 512 hardware limit.

Matrix multiplication is not data parallel.